Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create spam classification tutorial #112

Merged
merged 48 commits into from
Jan 5, 2022
Merged

Create spam classification tutorial #112

merged 48 commits into from
Jan 5, 2022

Conversation

bkmgit
Copy link
Contributor

@bkmgit bkmgit commented Aug 24, 2020

New command line tutorial

@mlpack-bot
Copy link

mlpack-bot bot commented Aug 24, 2020

Thanks for opening your first pull request in this repository! Someone will review it when they have a chance. In the mean time, please be sure that you've handled the following things, to make the review process quicker and easier:

  • All code should follow the style guide
  • Documentation added for any new functionality
  • Tests added for any new functionality
  • Tests that are added follow the testing guide
  • Headers and license information added to the top of any new code files
  • HISTORY.md updated if the changes are big or user-facing
  • All CI checks should be passing

Thank you again for your contributions! 👍

spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
rm labels.txt
```

The next step is to convert all text in the messages to lower case and for simplicity remove punctuation and any symbols that are not spaces, line endings or in the range a-z (one would need expand this range of symbols for production use)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I get the last sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it can be reworded as:

To enable easy comparison of words which will be used as the features, only letters a-z, line endings \n and spaces are used as features. A larger feature set can be helpful, but for small data sets the occurrences of other symbols are not frequent enough to help in classification.

rm messagesLower.txt
```

We now obtain a sorted list of unique words used (this step may take a few minutes, so use nice to give it a low priority while you continue with other tasks on your computer).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I would remove nice as the default behaviour, we could mention it on the side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a low end laptop nice is quite useful to enable other work. On a more powerful machine, the effect will not be to drastic, so that in both cases the code works.

spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
bkmgit and others added 7 commits August 28, 2020 21:12
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
Co-authored-by: Marcus Edel <[email protected]>
@bkmgit
Copy link
Contributor Author

bkmgit commented Aug 28, 2020

Thanks for the review. Can also put all commands in a bash script. Should there be a styleguide for the examples?

@zoq
Copy link
Member

zoq commented Aug 29, 2020

Thanks for the review. Can also put all commands in a bash script. Should there be a styleguide for the examples?

I like the idea, maybe we can split it up into multiple scripts, wondering if we should add a notebook that runs the bash scripts as well? The examples follow: https://github.com/mlpack/mlpack/wiki/DesignGuidelines

@bkmgit
Copy link
Contributor Author

bkmgit commented Aug 30, 2020

Can split it up into multiple scripts. Not so keen on notebooks for this, not so great for integration into production use. For example see rspamd and Spam Assassin perceptron description. For SMS spam Spam Hound is available, but not sure of an open source equivalent. Android does support C++, see documentation here.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bkmgit, this is really nice! Thank you for taking the time to put this together. I left some comments, mostly simple and stylistic; let me know what you think.

Can split it up into multiple scripts. Not so keen on notebooks for this, not so great for integration into production use.

I agree with what you mean here---most people using mlpack from the command-line won't be using notebooks. It might be simple to add a notebook that just runs the various scripts individually, but I do agree that perhaps in addition to tutorial.md, we should have a number of scripts in the directory that users can run directly. Like, e.g., spam-classification.sh could run everything, and then this could call out to other auxiliary scripts.

My mental model is that the most effective tutorials are ones where users can very quickly and easily run something (e.g. type a command or click 'run' in a notebook cell), and then once they see results, they can dig into the code to understand what's actually going on. So as long as we can structure it in a way such that the above is reasonably true, I think it's great! It may even be "easiest" to structure all the text in tutorials.md as comments in, e.g., some spam-classification.sh file, and then have lower-level comments in other auxiliary scripts? Just tossing the idea out there, don't feel obligated. Maybe there are better ways. :)

spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
rm dataset.csv
rm dataset1.csv
rm dataset.txt
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nice to see other people who do data science with sed, awk, rev, tr, and grep too! 😄

spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Show resolved Hide resolved
spam/tutorial.md Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
spam/tutorial.md Outdated Show resolved Hide resolved
bkmgit and others added 9 commits September 17, 2020 18:59
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Co-authored-by: Ryan Curtin <[email protected]>
Update formatting to make it neater
@rcurtin rcurtin reopened this Jul 30, 2021
@rcurtin
Copy link
Member

rcurtin commented Jul 30, 2021

@bkmgit I think you just need to chmod +x spam_classification.sh. Here's the relevant Travis build line:

/home/travis/.travis/functions: line 109: ./spam_classification.sh: Permission denied

@bkmgit
Copy link
Contributor Author

bkmgit commented Jul 30, 2021

Thanks. Done.

@bkmgit
Copy link
Contributor Author

bkmgit commented Jul 30, 2021

It does seem to run, but times out.

@bkmgit
Copy link
Contributor Author

bkmgit commented Jul 31, 2021 via email

message is then removed and placed in another file.
COMMENT

tr '\r' '\n' < ../data/dataset_sms_spam_bhs_indonesia_v1/dataset_sms_spam_v1.csv > dataset.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary on all systems? I found that the output I got here for dataset.txt when I ran locally was like this:

$ head dataset.txt 
Teks,label

"[PROMO] Beli paket Flash mulai 1GB di MY TELKOMSEL APP dpt EXTRA kuota 2GB 4G LTE dan EXTRA nelpon hingga 100mnt/1hr. Buruan, cek  di tsel.me/mytsel1 S&K",2

2.5 GB/30 hari hanya Rp 35 Ribu Spesial buat Anda yang terpilih. Aktifkan sekarang juga di *550*905#. Promo sd 30 Nov 2015.Buruan aktifkan sekarang. S&K,2

"2016-07-08 11:47:11.Plg Yth, sisa kuota Flash Anda 478KB. Download MyTelkomsel apps di http://tsel.me/tsel utk cek kuota&beli paket Flash atau hub *363#",2

"2016-08-07 11:29:47.Plg Yth, sisa kuota Flash Anda 7160KB. Download MyTelkomsel apps di http://tsel.me/tsel utk cek kuota&beli paket Flash atau hub *363#",2

What's the goal of the line? Maybe we can use dos2unix or something instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added sed '/^$/d' dataset2.csv > dataset.csv to remove the extra lines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regenerating data files. mlpack_preprocess_split fails with an error if one of the labels is a .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I encountered the same thing. I see that labels.csv has one line that's just a ., which can't be parsed:

$ cat labels.csv  | sort |uniq -c
      1 .
    569 0
    574 1

Maybe there is a bug or an extra case that needs to be handled in the preprocessing script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this, incorrect lines were joined. Checking build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :) I'm checking to see if it works on my system too. 👍

@bkmgit
Copy link
Contributor Author

bkmgit commented Aug 1, 2021

It seems to run correctly now.

  1. Can add some checks that confusion matrix values are within a difference of 10 from the reported ones.
  2. If the data.csv and labels.csv files are added to the dataset, can modify the writeup so that these only need to be recreated if someone wants to check. This will reduce running time in CI.
  3. Building the command line programs adds some time to the CI build process, but maybe this is ok, willing to add some further tests.

@rcurtin
Copy link
Member

rcurtin commented Aug 1, 2021

Perfect, everything works locally for me. Do you want to revert .travis.yml to its original form, but leave the spam classification example there except commented? Due to travis's limits, we'll have to move to Jenkins, where it doesn't matter if things take a long time. @zoq and I discussed it a bit in #174; maybe we can use a Docker container with all the command-line programs available already.

I don't think it's a problem to leave out data.csv and labels.csv---part of what makes the example interesting, in my opinion, is the command-line data science preprocessing. :)

Can add some checks that confusion matrix values are within a difference of 10 from the reported ones.

Up to you---I think personally as long as the script completes successfully, then everything should be working fine. If there is a failure, probably the mlpack programs will issue a non-zero exit code and then spam_classification.sh will return a non-zero exit code too.

@bkmgit
Copy link
Contributor Author

bkmgit commented Aug 2, 2021 via email

@rcurtin
Copy link
Member

rcurtin commented Aug 5, 2021

Sounds good! I agree on your comments; maybe we should open issues and handle those separately?

@bkmgit
Copy link
Contributor Author

bkmgit commented Nov 17, 2021

@rcurtin Ok, made changes. Will open separate issues.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it took so long to get back to this. Everything looks good to me! Right now Travis is no longer hooked up to this repository, but I'll open an issue to set up a Jenkins job instead.

Thanks again for the contribution!

@bkmgit
Copy link
Contributor Author

bkmgit commented Jan 3, 2022

Thanks for the feedback. Looking forward to making further contributions.

Copy link

@mlpack-bot mlpack-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second approval provided automatically after 24 hours. 👍

@rcurtin rcurtin merged commit c1e3eeb into mlpack:master Jan 5, 2022
@rcurtin
Copy link
Member

rcurtin commented Jan 5, 2022

Thanks again @bkmgit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants